Census data is central in situating analysis, whether we want to make use of census data directly, mix it with our own data or just use it to calibrate external data we have.
In this workshop we’ll explore how to work with census data and use it in conjunction with our own data.
Census data offers rich variables at high spatial resolution, but at coarse time intervals.
Richer data comes at a price: Data discovery and acquisition is more complex. Enter CensusMapper.
CensusMapper is a flexible census data mapping platform. Anyone can explore and map census data.
CensusMapper is also an API server to facilitate data acquisition for analysis, as a GUI data selection tool.
The cancensus R package interfaces with the CensusMapper API server. It can be queried for:
census geographies
census data
hierarchical metadata of census variables
some non-census data that comes on census geographies, e.g. T1FF taxfiler data
A slight complication, the cancensus package needs an API key. You can sign up for one on CensusMapper, and install it using the set_api_key function with the install=TRUE option so it’s always available and won’t expose your API key when sharing code.
This code is commented-out for now because you only need to run it once. When you get your API key, place it above and run that section of code without the # sign. It should look something like this:
This will install the API key as a system variable in your .Renviron so that it’s available in every R session and you won’t expose your API key when sharing code.
This step will store any census data you grab locally, so that you don’t have to keep re-downloading it every time. It’s really helpful for speeding up code when you’re working with bigger tables, especially ones that you use often.
[1] "~/Econ/Cancensus Cache"
You’ll see a use_cache argument in most of the functions we use next. That determines whether or not the code uses the data from your local cache or not. It’s set to TRUE by default, so it uses the data already stored on your computer if it’s already there.
To force cancensus to refresh the data and re-download it from StatCan, you can specify use_cache = FALSE as a parameter for the functions we’ll learn about next.
cancensus provides three different functions for retrieving Census data:
get_census_data to retrieve Census data only as a flat data framecensus_data <- get_census(dataset='CA21', regions=list(CMA="48835"), vectors=c("v_CA21_434"), level='CSD', use_cache = FALSE, geo_format = NA, quiet = TRUE)
summary(census_data) GeoUID Type Region Name Area (sq km)
Length:34 CSD:34 Alexander 134 (IRI): 1 Min. : 0.1922
Class :character Beaumont (CY) : 1 1st Qu.: 2.1634
Mode :character Betula Beach (SV) : 1 Median : 10.3084
Bon Accord (T) : 1 Mean : 276.9466
Bruderheim (T) : 1 3rd Qu.: 50.6092
Calmar (T) : 1 Max. :2502.5902
(Other) :28
Population Dwellings Households CD_UID
Min. : 18 Min. : 5.0 Min. : 5.0 Length:34
1st Qu.: 355 1st Qu.: 287.2 1st Qu.: 147.8 Class :character
Median : 1643 Median : 600.0 Median : 566.5 Mode :character
Mean : 41709 Mean : 17339.8 Mean : 16136.0
3rd Qu.: 19544 3rd Qu.: 7398.2 3rd Qu.: 7003.2
Max. :1010899 Max. :428857.0 Max. :396404.0
PR_UID CMA_UID
Length:34 Length:34
Class :character Class :character
Mode :character Mode :character
v_CA21_434: Occupied private dwellings by structural type of dwelling data
Min. : 25.0
1st Qu.: 376.2
Median : 1055.0
Mean : 19590.7
3rd Qu.: 7956.2
Max. :396400.0
NA's :6
get_census_geometry to retrieve Census geography only as a collection of spatial polygonscensus_data <- get_census(dataset='CA21', regions=list(CMA="48835"),
vectors=c("v_CA21_434"),
level='CSD', use_cache = FALSE, geo_format = 'sf', quiet = TRUE)
summary(census_data) CMA_UID CD_UID name Dwellings 2016
Length:34 Length:34 Length:34 Min. : 5
Class :character Class :character Class :character 1st Qu.: 308
Mode :character Mode :character Mode :character Median : 604
Mean : 15813
3rd Qu.: 6719
Max. :388254
Population Dwellings PR_UID Population 2016
Min. : 18 Min. : 5.0 Length:34 Min. : 10.0
1st Qu.: 355 1st Qu.: 287.2 Class :character 1st Qu.: 301.5
Median : 1643 Median : 600.0 Mode :character Median : 1641.0
Mean : 41709 Mean : 17339.8 Mean : 38865.9
3rd Qu.: 19544 3rd Qu.: 7398.2 3rd Qu.: 17390.0
Max. :1010899 Max. :428857.0 Max. :933088.0
Households Type GeoUID Households 2016
Min. : 5.0 CSD:34 Length:34 Min. : 5.0
1st Qu.: 147.8 Class :character 1st Qu.: 126.8
Median : 566.5 Mode :character Median : 534.0
Mean : 16136.0 Mean : 14769.1
3rd Qu.: 7003.2 3rd Qu.: 6394.2
Max. :396404.0 Max. :361033.0
Quality Flags Shape Area Region Name
Length:34 Min. : 0.1922 Alexander 134 (IRI): 1
Class :character 1st Qu.: 2.1634 Beaumont (CY) : 1
Mode :character Median : 10.3084 Betula Beach (SV) : 1
Mean : 276.9466 Bon Accord (T) : 1
3rd Qu.: 50.6092 Bruderheim (T) : 1
Max. :2502.5902 Calmar (T) : 1
(Other) :28
Area (sq km)
Min. : 0.1922
1st Qu.: 2.1634
Median : 10.3084
Mean : 276.9466
3rd Qu.: 50.6092
Max. :2502.5902
v_CA21_434: Occupied private dwellings by structural type of dwelling data
Min. : 25.0
1st Qu.: 376.2
Median : 1055.0
Mean : 19590.7
3rd Qu.: 7956.2
Max. :396400.0
NA's :6
geometry
MULTIPOLYGON :34
epsg:4326 : 0
+proj=long...: 0
get_census is used to retrieve Census data and geography as a spatial dataset together
census_data <- get_census(dataset='CA21', regions=list(CMA="48835"),
vectors=c("v_CA21_434"),
level='CSD', use_cache = FALSE, geo_format = 'sp', quiet = TRUE)
head(census_data) %>% knitr::kable()| CMA_UID | CD_UID | name | Dwellings.2016 | Population | Dwellings | PR_UID | Population.2016 | Households | Type | GeoUID | Households.2016 | Quality.Flags | Shape.Area | Region.Name | Area..sq.km. | v_CA21_434..Occupied.private.dwellings.by.structural.type.of.dwelling.data |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48835 | 4810 | Bruderheim (T) | 629 | 1329 | 552 | 48 | 1323 | 515 | CSD | 4810066 | 507 | 0 | 9.2781 | Bruderheim (T) | 9.2781 | 515 |
| 48835 | 4811 | Leduc County (MD) | 5621 | 14416 | 5990 | 48 | 13177 | 5295 | CSD | 4811012 | 4875 | 0 | 2502.5902 | Leduc County (MD) | 2502.5902 | 5295 |
| 48835 | 4811 | Beaumont (CY) | 6015 | 20888 | 7168 | 48 | 17457 | 6950 | CSD | 4811013 | 5654 | 0 | 24.7019 | Beaumont (CY) | 24.7019 | 6950 |
| 48835 | 4811 | Leduc (CY) | 12264 | 34094 | 13507 | 48 | 29993 | 12964 | CSD | 4811016 | 11319 | 0 | 42.2532 | Leduc (CY) | 42.2532 | 12960 |
| 48835 | 4811 | Devon (T) | 2493 | 6545 | 2588 | 48 | 6578 | 2496 | CSD | 4811018 | 2415 | 0 | 14.2554 | Devon (T) | 14.2554 | 2495 |
| 48835 | 4811 | Calmar (T) | 861 | 2183 | 937 | 48 | 2228 | 893 | CSD | 4811019 | 842 | 0 | 4.6686 | Calmar (T) | 4.6686 | 895 |
Cancensus can access Statistics Canada Census data for Census years 1996, 2001, 2006, 2011, 2016, and 2021. You can run list_census_datasets to check what datasets are currently available for access through the CensusMapper API.
# A tibble: 29 × 6
dataset description geo_dataset attribution reference reference_url
<chr> <chr> <chr> <chr> <chr> <chr>
1 CA1996 1996 Canada Census CA1996 StatCan 19… 92-351-U https://www1…
2 CA01 2001 Canada Census CA01 StatCan 20… 92-378-X https://www1…
3 CA06 2006 Canada Census CA06 StatCan 20… 92-566-X https://www1…
4 CA11 2011 Canada Census a… CA11 StatCan 20… 98-301-X… https://www1…
5 CA16 2016 Canada Census CA16 StatCan 20… 98-301-X https://www1…
6 CA21 2021 Canada Census CA21 StatCan 20… 98-301-X https://www1…
7 CA01xSD 2001 Canada Census x… CA01 StatCan 20… 92-378-X https://www1…
8 CA06xSD 2006 Canada Census x… CA06 StatCan 20… 92-566-X https://www1…
9 CA11xSD 2011 Canada Census x… CA11 StatCan 20… 98-301-X https://www1…
10 CA16xSD 2016 Canada Census x… CA16 StatCan 20… 98-301-X https://www1…
# ℹ 19 more rows
Census data is aggregated at multiple geographic levels. Census geographies at the national (C), provincial (PR), census metropolitan area (CMA), census agglomeration (CA), census division (CD), and census subdivision (CSD) are defined as named census regions.
Canadian Census geography can change in between Census periods. Cancensus provides a function, list_census_regions(dataset), to display all named census regions and their corresponding id for a given census dataset.
# A tibble: 5,518 × 8
region name level pop municipal_status CMA_UID CD_UID PR_UID
<chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 01 Canada C 3.70e7 <NA> <NA> <NA> <NA>
2 35 Ontario PR 1.42e7 Ont. <NA> <NA> <NA>
3 24 Quebec PR 8.50e6 Que. <NA> <NA> <NA>
4 59 British Columbia PR 5.00e6 B.C. <NA> <NA> <NA>
5 48 Alberta PR 4.26e6 Alta. <NA> <NA> <NA>
6 46 Manitoba PR 1.34e6 Man. <NA> <NA> <NA>
7 47 Saskatchewan PR 1.13e6 Sask. <NA> <NA> <NA>
8 12 Nova Scotia PR 9.69e5 N.S. <NA> <NA> <NA>
9 13 New Brunswick PR 7.76e5 N.B. <NA> <NA> <NA>
10 10 Newfoundland and … PR 5.11e5 N.L. <NA> <NA> <NA>
# ℹ 5,508 more rows
# A tibble: 1 × 8
region name level pop municipal_status CMA_UID CD_UID PR_UID
<chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 48835 Edmonton CMA 1418118 B <NA> <NA> 48
This is how we got to the function from above. To grab the same data for Edmonton, broken down into a smaller geographic level, we can slightly modify the level= argument.
Census data contains thousands of different geographic regions as well as thousands of unique variables. In addition to enabling programmatic and reproducible access to Census data, cancensus has a number of tools to help users find the data they are looking for.
You can run the following code to view all available Census variables for a given dataset:
# A tibble: 7,709 × 7
vector type label units parent_vector aggregation details
<chr> <fct> <chr> <fct> <chr> <chr> <chr>
1 v_CA21_1 Total Population, 2021 Numb… <NA> Additive CA 202…
2 v_CA21_2 Total Population, 2016 Numb… <NA> Additive CA 202…
3 v_CA21_3 Total Population percenta… Numb… <NA> Average of… CA 202…
4 v_CA21_4 Total Total private dwell… Numb… <NA> Additive CA 202…
5 v_CA21_5 Total Private dwellings o… Numb… v_CA21_4 Additive CA 202…
6 v_CA21_6 Total Population density … Ratio <NA> Average of… CA 202…
7 v_CA21_7 Total Land area in square… Numb… <NA> Additive CA 202…
8 v_CA21_8 Total Total - Age Numb… <NA> Additive CA 202…
9 v_CA21_9 Male Total - Age Numb… <NA> Additive CA 202…
10 v_CA21_10 Female Total - Age Numb… <NA> Additive CA 202…
# ℹ 7,699 more rows
For each variable (vector) in that Census dataset, this shows:
Vector: short variable code
Type: variables are provided as aggregates of female responses, male responses, or total (male+female) responses
Label: detailed variable name
Units: provides information about whether the variable represents a count integer, a ratio, a percentage, or a currency figure
Parent_vector: shows the immediate hierarchical parent category for that variable, where appropriate
Aggregation: indicates how the variable should be aggregated with others, whether it is additive or if it is an average of another variable
Description: a rough description of a variable based on its hierarchical structure. This is constructed by cancensus by recursively traversing the labels for every variable’s hierarchy, and facilitates searching for specific variables using key terms
As you can tell, it’s pretty hard to find a dataset on your own just by browsing that list. Cancensus uses the find_census_vectors() function to help with that.
# A tibble: 3 × 4
vector type label details
<chr> <fct> <chr> <chr>
1 v_CA21_4812 Total Australia 25% Data; Citizenship and immigration; Total - P…
2 v_CA21_5223 Total Australian 25% Data; Visible minority and ethnic origin; To…
3 v_CA21_6483 Total Australia 25% Data; Education; Total - Location of study c…
# A tibble: 251 × 4
vector type label details
<chr> <fct> <chr> <chr>
1 v_CA21_4917 Total Total - Ethnic or cultural origin for the populati… 25% Da…
2 v_CA21_4920 Total Canadian 25% Da…
3 v_CA21_4923 Total English 25% Da…
4 v_CA21_4926 Total Irish 25% Da…
5 v_CA21_4929 Total Scottish 25% Da…
6 v_CA21_4932 Total French, n.o.s. 25% Da…
7 v_CA21_4935 Total German 25% Da…
8 v_CA21_4938 Total Chinese 25% Da…
9 v_CA21_4941 Total Italian 25% Da…
10 v_CA21_4944 Total Indian (India) 25% Da…
# ℹ 241 more rows
The “exact” search is very precise, but you can miss out on key tables if you don’t know exactly what you’re looking for.
One other search option is “keyword”, which looks for the highest number of matches. It also has an interactive option that you can play around with:
#list_census_regions('CA21') %>% filter(level=="CMA", name=="Edmonton")
find_census_vectors("Low Income Measures",dataset="CA21",type="total",query_type="semantic") %>% knitr::kable()| vector | type | label | details |
|---|---|---|---|
| v_CA21_1025 | Total | In low income based on the Low-income measure, after tax (LIM-AT) | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT) |
| v_CA21_1028 | Total | 0 to 17 years | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 0 to 17 years |
| v_CA21_1031 | Total | 0 to 5 years | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 0 to 17 years; 0 to 5 years |
| v_CA21_1034 | Total | 18 to 64 years | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 18 to 64 years |
| v_CA21_1037 | Total | 65 years and over | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 65 years and over |
| v_CA21_1040 | Total | Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%) | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%) |
| v_CA21_1043 | Total | 0 to 17 years | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 0 to 17 years |
| v_CA21_1046 | Total | 0 to 5 years | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 0 to 17 years; 0 to 5 years |
| v_CA21_1049 | Total | 18 to 64 years | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 18 to 64 years |
| v_CA21_1052 | Total | 65 years and over | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 65 years and over |
find_census_vectors("Low Income Measures %",dataset="CA21",type="total",query_type="semantic") %>% knitr::kable()| vector | type | label | details |
|---|---|---|---|
| v_CA21_1025 | Total | In low income based on the Low-income measure, after tax (LIM-AT) | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT) |
| v_CA21_1028 | Total | 0 to 17 years | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 0 to 17 years |
| v_CA21_1031 | Total | 0 to 5 years | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 0 to 17 years; 0 to 5 years |
| v_CA21_1034 | Total | 18 to 64 years | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 18 to 64 years |
| v_CA21_1037 | Total | 65 years and over | Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 65 years and over |
| v_CA21_1040 | Total | Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%) | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%) |
| v_CA21_1043 | Total | 0 to 17 years | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 0 to 17 years |
| v_CA21_1046 | Total | 0 to 5 years | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 0 to 17 years; 0 to 5 years |
| v_CA21_1049 | Total | 18 to 64 years | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 18 to 64 years |
| v_CA21_1052 | Total | 65 years and over | Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 65 years and over |
# A tibble: 1 × 8
region name level pop municipal_status CMA_UID CD_UID PR_UID
<chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 59933 Vancouver CMA 2642825 B <NA> <NA> 59
If we want to add different themes to the graph, there are a lot of options to choose from:
poverty_data <- get_census("CA21", regions=list(CMA="59933"), vectors=pv, geo_format="sf", level="CT")
ggplot(poverty_data,aes(fill=lico_at/100)) +
geom_sf(size=NA) +
labs(title="% of children in poverty - Vancouver",fill=NULL,caption="StatCan Census 2021") +scale_fill_viridis_c(option = "inferno",,labels=scales::percent)Mixing data sources is hard, especially when dealing with spatial data.
If spatial units match across datasets, it is easy to compare the other data we have. If spatial units don’t match, things get complicated. And annoying.
Simple example is LFS time series for CMAs. We don’t have one long time series but several partially overlapping shorter time series. The reason is that CMA geography changes over time, so we can’t directly compare data when geography (denominators) change.
The cansim package returns a geographic identifier GeoUID that matches census identifiers returned by cancensus. That makes matching data from those two data sources relatively easy.
We’ll try out an example here with an Income Distribution measure from StatCan:
income_distribution <- get_cansim("11-10-0074") %>% select(GeoUID,`D-index`=VALUE)
toronto <- get_census("CA16",regions=list(CMA="35535"),geo_format = 'sf',level="CT")
merged_data <- left_join(toronto,income_distribution, by="GeoUID")
merged_data %>%
ggplot(aes(fill=`D-index`)) +
geom_sf(size=0.1) + scale_fill_viridis_c() +
coord_sf(datum=NA,xlim=c(-79.8,-79.15),ylim=c(43.6,43.8)) +
labs(title="Income divergence index", caption="StatCan table 11-10-0074")Census geographies often change over time, which complicates comparisons using more than one year of data.
The best way to deal with this is a custom data request, but that takes time, costs money and is overkill for many applications. An immediate way to achieve almost the same result is using the tongfen package.
Tongfen ensures that while the spatial units change, they are still comparable. In other words, they’re derived from one another by a (generally short) series of split and join operations.
With this example, we’ll take a look at how the number of children in Toronto has changed over time. We’ll use the CensusMapper API GUI to select the Census vectors that we need. We’ll get data on the children under 15, assembled from 5 year age groups for males and females for 2001 and 2021.
# A tibble: 3 × 8
region name level pop municipal_status CMA_UID CD_UID PR_UID
<chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 35535 Toronto CMA 6202225 B <NA> <NA> 35
2 3520 Toronto CD 2794356 CDR <NA> <NA> 35
3 3520005 Toronto CSD 2794356 C 35535 3520 35
# "meta" vectors for Tongfen are the Census vectors that we need to compare over time
meta <- meta_for_ca_census_vectors(c(children_2021="v_CA21_11",
children1m_2001="v_CA01_7",
children2m_2001="v_CA01_8",
children3m_2001="v_CA01_9",
children1f_2001="v_CA01_26",
children2f_2001="v_CA01_27",
children3f_2001="v_CA01_28"))Since the data from the 2001 Census is separated by 5 year age groups instead of 15, we’ll need to add them up to compare them with the 2021 Census Data.
This fancy code at the end gets rid of all the exrtra columns we had, since the original table was pretty clunky. The first part of the matches() function here is keeping only columns that start with “children_” and have exactly 4 numbers at the end (e.g. - children_2021). The second part keeps the columns that have “Population” in their name.
ggplot(plot_data, aes(fill=children_2021/Population_CA21-children_2001/Population_CA01)) +
geom_sf() +
scale_fill_gradient2(labels=scales::percent) +
coord_sf(datum=NA) +
labs(title="City of Toronto change in share of children under 15 between 2001 to 2021",
fill="Percentage\npoint\nchange",
caption="StatCan Census 2001, 2021")Now that we’ve gone through a lot of examples, we’ll have you all try out one on your own. In the meantime, I’ll show you through the steps again (if you need it) and answer any questions you have. Here’s the goal:
Find one Census Year & Vector that you want to analyze. Either with the CensusMapper in the browser, or with the find_census_vectors() function.
Pick a geography you’ll want to map. You can again use CensusMapper to grab the GeoID in the browser, or you can use the list_census_regions() function here in R.
Map the data. Choose the level of geographic detail you want with the level="CT/CMA/etc." code.
Good luck!